cd/entity/KV cache· home› entities› KV cache

grep -l @kv cache /news/*.json | wc -l → 12

KV cache

mentions 12 type Person feed RSS

// recent coverage 12 mentions

16:00

2026-06-25

newsroom.arm.com

artificial-intelligence

From host node to heterogeneous rack: Rethinking the AI CPU

AI infrastructure is entering a new phase focused on rack-scale system composition for agentic AI workflows, where CPUs play critical orchestration roles alongside accelerators. The shift from single-…

04:04

2026-06-24

pub.towardsai.net

artificial-intelligence

The Governance of Reasoning

AI engineering faces a contradiction between paying premium for frontier models' reasoning capabilities and aggressively compressing context to reduce costs, leading to a 'fallacy of context compactio…

09:21

2026-06-18

discuss.huggingface.co

artificial-intelligence

Shannon Prime Lattice

Researchers developed XBAR, an auditable latent crossbar memory fabric that enables model-to-model communication by writing directly into a frozen transformer's KV cache. The system achieves O(1) VRAM…

03:56

2026-06-17

dev.to

large-language-models

How much VRAM do you actually need to run Llama 3 or Gemma locally?

A developer calculated the actual VRAM requirements for running Llama 3 8B and Gemma 2 9B locally, revealing that the KV cache can consume far more memory than the model weights, especially at longer …

16:12

2026-06-16

twitter.com

artificial-intelligence

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

A developer implemented a full Transformer with KV cache on an FPGA, achieving over 56,000 tokens per second at only 80 MHz, without using a GPU or CPU. The design was created gate by gate as a custom…

08:45

2026-06-16

thecomputersciencebook.com

large-language-models

PagedAttention is more than virtual memory

PagedAttention, a memory optimization technique in the vLLM inference server, applies virtual memory concepts to manage the KV cache in large language models, improving throughput by reducing fragment…

11:34

2026-06-13

vettedconsumer.com

artificial-intelligence

Show HN: Quant Picker – which GGUF file fits your model and machine

Quant Picker is a new tool that calculates which GGUF quantization level fits a given model and machine, balancing file size, quality, and context budget. It recommends the highest quantization that l…

00:20

2026-05-26

ranvier.systems

large-language-models

Tokenization Is the Bottleneck You're Not Measuring

A hidden bottleneck in LLM proxy architectures is causing 5-13 millisecond blocking delays per request during tokenization, a CPU-bound operation that most systems treat as instantaneous. In event-loo…

08:34

2026-05-23

dev.to

large-language-models

We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.

Based on the article, the authors replaced their RAG pipeline with a persistent KV cache system, which stores the full document's attention state after a single prefill and reuses it for every query. …

11:37

2026-05-21

dev.to

large-language-models

End-to-End Observability for vLLM and TGI: from DCGM to Tokens

Running large language model inference servers like vLLM and TGI in production requires specialized observability because they behave differently from standard web services, with key metrics like late…

06:20

2026-05-20

dev.to

large-language-models

KV Cache Explained Like You're an LLM Engineer

The KV cache is a critical optimization for LLM inference that stores the Key and Value matrices from previously generated tokens, eliminating the need to recompute attention over the entire sequence …

00:00

2026-05-14

huggingface.co

machine-learning

Unlocking asynchronicity in continuous batching

Synchronous continuous batching in LLM inference causes inefficiency by forcing the CPU and GPU to work sequentially, leaving one idle while the other operates. This idle time can account for nearly a…

// co-occurs with top 8 entities

vLLM 3 PagedAttention 2 GPU 2 FPGA 1 Transformer 1 microGPT 1 Andrej Karpathy 1 Quant Picker 1